A frequent pattern is a substructure that appears frequently in a dataset. Finding the frequent patterns of a dataset is a essential step in data mining tasks such as feature extraction and a necessary ingredient of association rule learning. This kind of algorithms are extremely useful in the field of Market Basket Analysis, which in turn provide retailers with invaluable information about their customer shopping habbits and needs.
Here, I will shortly describe the GraphLab Create frequent pattern mining toolkit, the tools it provides and its functionality. Major advantage of this high-level ML toolkit is the ease it provides to train an association rule mining algorithm, as well as the high interpretability of the returned results. Under the hood the GLG frequent pattern mining toolkit runs a TFP-Growth algorithm, introduced by Wang, Jianyong, et al. in 2005. For a recent review of the various directions in the field consult Han, Jiawei, et al. "Frequent pattern mining: current status and future directions.", Data Mining and Knowledge Discovery 15.1 (2007): 55-86.
In [1]:
import graphlab as gl
from graphlab import aggregate as agg
from visualization_helper_functions import *
Here we discuss a simple example of receipt data from a bakery. The dataset consists of items like 'ApplePie'
and 'GanacheCookie'
. The task is to identify sets of items that are frequently bought together. The dataset consists of 266209 rows and 6 columns which look like the following. The dataset was constructed by modifying the Extended BAKERY dataset.
In [2]:
bakery_sf = gl.SFrame('./bakery_sf')
bakery_sf
Out[2]:
As we can see below, all the coffee products have similar sale frequencies and there is no some particular subset of products that is more preferred than the remaining ones.
In [3]:
%matplotlib inline
item_freq_plot(bakery_sf, 'Item', ndigits=3, topk=30,
seaborn_style='whitegrid', seaborn_palette='deep', color='b')
Next, we split the bakery_sf
data set in a training and a test part.
In [4]:
(train, test) = bakery_sf.random_split(0.8, seed=1)
In [5]:
print 'Number of Rows in training set [80pct of Known Examples]: %d' % train.num_rows()
print 'Number of Rows in test set [20pct of Known Examples]: %d' % test.num_rows()
In order to run a frequent pattern mining algorithm, we require an item column, (the column 'Item'
in this example), and a set of feature columns that uniquely identify a transaction (the columns ['Receipt', 'StoreNum']
in this example, since we need to take in account the geophraphical location of each store and the accompanied social-economic criteria that may exist).
In addition we need to specify the 3 basic parameters of the FP-Growth algorithm which is called by the high-level GraphLab Create (GLC) function. These are:
min_support
: The minimum number of times that a pattern must occur in order to be considered a frequent one. Here, we choose a threshold of 1‰ of total transactions in record to be the min_support
.max_patterns
: The maximum number of frequent patterns to be mined.min_length
: The minimum size (number of elements in the set) of each pattern being mined.
In [6]:
min_support = int(train.num_rows()*0.001)
model = gl.frequent_pattern_mining.create(train, 'Item',
features=['Receipt', 'StoreNum'],
min_support=min_support,
max_patterns=500,
min_length=4)
Here, we obtain the most frequent feature patterns.
In [7]:
print 'The most frequent feature patters are:'
print '-----------------------------------------'
model.frequent_patterns.print_rows(max_column_width=80, max_row_width=90)
Note that the 'pattern'
column contains the patterns that occur frequently together, whereas the 'support'
column contains the number of times these patterns occur together in the entire dataset.
In this example, the pattern:
[CoffeeEclair, HotCoffee, ApplePie, AlmondTwist]
occurred 877 times in the training data.
Definition
A frequent pattern is a set of items with a support greater than user-specified minimum support threshold.
However, there is significant redundancy in mining frequent patterns; every subset of a frequent pattern is also frequent (e.g. 'CoffeeEclair'
must be frequent if ['CoffeeEclair'
, 'HotCoffee']
is frequent). The frequent pattern mining toolkit avoids this redundancy by mining the closed frequent patterns, i.e. frequent patterns with no superset of the same support. This is achieved by the very design of the TFP-Growth Algorithm.
Note, that by relaxing the min_length
requirement, one can obtain more frequent patterns of sold coffe products.
In [8]:
min_support = int(train.num_rows()*0.001)
model = gl.frequent_pattern_mining.create(train, 'Item',
features=['Receipt', 'StoreNum'],
min_support=min_support,
max_patterns=500,
min_length=3)
In [9]:
print 'The most frequent feature patters are:'
print '-----------------------------------------'
model.frequent_patterns.print_rows(num_rows=35, max_column_width=80, max_row_width=90)
Alternatively, by decreasing the min_support
one can obtain more patterns of sold coffee products which are again assumed frequent but with respect to this new threshold.
In [10]:
min_support = int(train.num_rows()*(1e-04))
model = gl.frequent_pattern_mining.create(train, 'Item',
features=['Receipt', 'StoreNum'],
min_support=min_support,
max_patterns=500,
min_length=4)
In [11]:
print 'The most frequent feature patters are:'
print '-----------------------------------------'
model.frequent_patterns.print_rows(num_rows=60, max_row_width=90, max_column_width=80)
To see some details of the trained model:
In [12]:
print model
In practice, we rarely know the appropriate min_support
threshold to use. As an alternative to specifying a minimum support, we can specify a maximum number of patterns to mine using the max_patterns
parameter. Instead of mining all patterns above a minimum support threshold, we mine the most frequent patterns until the maximum number of closed patterns are found. For large data sets, this mining process can be time-consuming. We recommend specifying a somehow large initial minimum support bound to speed up the mining.
In [13]:
min_support = int(train.num_rows()*1e-03)
top5_freq_patterns = gl.frequent_pattern_mining.create(train, 'Item',
features=['Receipt', 'StoreNum'],
min_support=min_support,
max_patterns=5,
min_length=4)
The top-5 most frequent patterns are:
In [14]:
print top5_freq_patterns
We can always save the trained model by calling:
In [15]:
top5_freq_patterns.save('./top5_freq_patterns_model')
An association rule is an ordered pair of item sets (prefix $ A $, prediction $ B $) denoted $ A\Rightarrow B $ such that $ A $ and $ B $ are disjoint whereas $ A\cup B $ is frequent. The most popular criteria for scoring association rules is to measure the confidence of the rule: the ratio of the support of $ A\cup B $ to the support of $ A $.
$$ \textrm{Confidence}(A\Rightarrow B) = \frac{\textrm{Supp}(A\cup B)}{\textrm{Supp}(A)}. $$The confidence of the rule $ A\Rightarrow B $ is our empirical estimate of the conditional probability for $ B $ given $ A $.
One can make predictions using the predict()
or predict_topk()
method for single and multiple predictions respectively. The output of both the methods is an SFrame with the following columns:
In [16]:
predictions = top5_freq_patterns.predict(test)
In [17]:
predictions.print_rows(max_row_width=100)
.extract_feature()
methodUsing the set of closed patterns, we can convert pattern data to binary features vectors. These feature vectors can be used for other machine learning tasks, such as clustering or classification. For each input pattern x
, the j-th
extracted feature f_{x}[j]
is a binary indicator of whether the j-th
closed pattern is contained in x
.
First, we train the top100_freq_patterns
model as shown below:
In [18]:
top100_freq_patterns = gl.frequent_pattern_mining.\
create(train, 'Item',
features=['Receipt', 'StoreNum'],
# occurs at least once in our data record
min_support=1,
# do not search for more than 100 patterns
max_patterns = 100,
# test data have only one coffee product sold per tid .
# We search for patterns of at least 2 coffee products
min_length=2)
Here are the 100 unique closed patterns which are found frequent:
In [19]:
top100_freq_patterns.frequent_patterns.\
print_rows(num_rows=100, max_row_width=90, max_column_width=80)
Next, we apply the extract_features()
method of this newly trained top100_freq_patterns
model on the test
data set.
In [20]:
features = top100_freq_patterns.extract_features(train)
Once the features are extracted, we can use them downstream in other applications such as clustering, classification, churn prediction, recommender systems etc.
In [21]:
features.print_rows(num_rows=10, max_row_width=90, max_column_width=100)
In [22]:
emps = train.groupby(['Receipt', 'StoreNum'],
{'EmpId': agg.SELECT_ONE('EmpId')})
emps
Out[22]:
Next, we count the instances that each of the top100_freq_patterns
occurs per EmpId
.
In [23]:
emp_space = emps.join(features).\
groupby('EmpId', {'all_features': agg.SUM('extracted_features')})
emp_space
Out[23]:
Finally, we train a kmeans algorithm to produce a 3 centered cluster.
In [24]:
cl_model = gl.kmeans.create(emp_space,
features = ['all_features'],
num_clusters=3)
In [25]:
emp_space['cluster_id'] = cl_model['cluster_id']['cluster_id']
emp_space
Out[25]:
And we can provide a countplot of the Number of Stores per Cluster Id as below.
In [26]:
%matplotlib inline
segments_countplot(emp_space, x='cluster_id',
figsize_tuple=(12,7), title='Number of Stores per Cluster ID')
In [ ]: